|
Cascading is a software abstraction layer for Apache Hadoop. Cascading is used to create and execute complex data processing workflows on a Hadoop cluster using any JVM-based language (Java, JRuby, Clojure, etc.), hiding the underlying complexity of MapReduce jobs. It is open source and available under the Apache License. Commercial support is available from Concurrent, Inc.〔(Cascading support page )〕 Cascading was originally authored by Chris Wensel, who later founded Concurrent, Inc.〔(Concurrent, Inc. )〕 Cascading is being actively developed by the community and a number of add-on modules are available.〔(Cascading modules )〕 ==Architecture== To use Cascading, Apache Hadoop must also be installed, and the Hadoop job .jar must contain the Cascading .jars. Cascading consists of a data processing API, integration API, process planner and process scheduler. Cascading leverages the scalability of Hadoop but abstracts standard data processing operations away from underlying map and reduce tasks.〔( Blog post by Etsy describing their use of Cascading with Hadoop )〕 Developers use Cascading to create a .jar file that describes the required processes. It follows a ‘source-pipe-sink’ paradigm, where data is captured from sources, follows reusable ‘pipes’ that perform data analysis processes, where the results are stored in output files or ‘sinks’. Pipes are created independent from the data they will process. Once tied to data sources and sinks, it is called a ‘flow’. These flows can be grouped into a ‘cascade’, and the process scheduler will ensure a given flow does not execute until all its dependencies are satisfied. Pipes and flows can be reused and reordered to support different business needs.〔(Cascading User Guide )〕 Developers write the code in a JVM-based language and do not need to learn MapReduce. The resulting program can be regression tested and integrated with external applications like any other Java application.〔(Concurrent product page )〕 Cascading is most often used for ad targeting, log file analysis, bioinformatics, machine learning, predictive analytics, web content mining, and extract, transform and load (ETL) applications.〔(Concurrent home page )〕 抄文引用元・出典: フリー百科事典『 ウィキペディア(Wikipedia)』 ■ウィキペディアで「Cascading (software)」の詳細全文を読む スポンサード リンク
|